Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery. [http://machinelearningmastery.com/]

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Sensorless Drive Diagnosis is a multi-class classification situation where we are trying to predict one of the several possible outcomes.

INTRODUCTION: The dataset contains features extracted from electric current drive signals. The drive has both intact and defective components. The signals can result in 11 different classes with different conditions. Each condition has been measured several times by 12 different operating conditions, such as speeds, load moments, and load forces.

In this iteration, we will establish the baseline accuracy measurement for comparison with future rounds of modeling.

ANALYSIS: The baseline performance of the machine learning algorithms achieved an average accuracy of 85.53%. Two algorithms (Random Forest and Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Random Forest turned in the top overall result and achieved an accuracy metric of 99.92%. After applying the optimized parameters, the Random Forest algorithm processed the testing dataset with an accuracy of 99.90%, which was even better than the prediction from the training data.

CONCLUSION: For this iteration, the Random Forest algorithm achieved the best overall training and validation results. For this dataset, Random Forest could be considered for further modeling.

Dataset Used: Sensorless Drive Diagnosis Data Set

Dataset ML Model: Multi-class classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Dataset+for+Sensorless+Drive+Diagnosis

The project aims to touch on the following areas:

  1. Document a predictive modeling problem end-to-end.
  2. Explore data cleaning and transformation options
  3. Explore non-ensemble and ensemble algorithms for baseline model performance
  4. Explore algorithm tuning techniques for improving model performance

Any predictive modeling machine learning project genrally can be broken down into about six major tasks:

  1. Prepare Environment
  2. Summarize Data
  3. Prepare Data
  4. Model and Evaluate Algorithms
  5. Improve Accuracy or Results
  6. Finalize Model and Present Results

1. Prepare Environment

1.a) Load libraries and packages

startTimeScript <- proc.time()
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(corrplot)
## corrplot 0.84 loaded
library(DMwR)
## Loading required package: grid
## Registered S3 method overwritten by 'xts':
##   method     from
##   as.zoo.xts zoo
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(Hmisc)
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
## 
##     cluster
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(ROCR)
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
library(stringr)

1.b) Set up the controlling parameters and functions

# Create the random seed number for reproducible results
seedNum <- 888

# Set up the notifyStatus flag to stop sending progress emails (setting to TRUE will send status emails!)
notifyStatus <- TRUE
if (notifyStatus) library(mailR)
## Registered S3 method overwritten by 'R.oo':
##   method        from       
##   throw.default R.methodsS3
# Run algorithms using 10-fold cross validation
control <- trainControl(method="repeatedcv", number=10, repeats=1)
metricTarget <- "Accuracy"
# Set up the email notification function
email_notify <- function(msg=""){
  sender <- Sys.getenv("MAIL_SENDER")
  receiver <- Sys.getenv("MAIL_RECEIVER")
  gateway <- Sys.getenv("SMTP_GATEWAY")
  smtpuser <- Sys.getenv("SMTP_USERNAME")
  password <- Sys.getenv("SMTP_PASSWORD")
  sbj_line <- "Notification from R Multi-Class Classification Script"
  send.mail(
    from = sender,
    to = receiver,
    subject= sbj_line,
    body = msg,
    smtp = list(host.name = gateway, port = 587, user.name = smtpuser, passwd = password, ssl = TRUE),
    authenticate = TRUE,
    send = TRUE)
}
if (notifyStatus) email_notify(paste("Library and Data Loading has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@3b6eb2ec}"

1.c) Load dataset

# Slicing up the document path to get the final destination file name
dataset_path <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/00325/Sensorless_drive_diagnosis.txt'
doc_path_list <- str_split(dataset_path, "/")
dest_file <- doc_path_list[[1]][length(doc_path_list[[1]])]

if (!file.exists(dest_file)) {
  # Download the document from the website
  cat("Downloading", dataset_path, "as", dest_file, "\n")
  download.file(dataset_path, dest_file, mode = "wb")
  cat(dest_file, "downloaded!\n")
#  unzip(dest_file)
#  cat(dest_file, "unpacked!\n")
}

inputFile <- dest_file
colNames <- paste0("attr",1:48)
colNames <- c(colNames, 'targetVar')
Xy_original <- read.csv(inputFile, sep=' ', header=FALSE, col.names = colNames)
# Take a peek at the dataframe after the import
head(Xy_original)
##         attr1       attr2       attr3       attr4       attr5       attr6
## 1 -3.0146e-07  8.2603e-06 -1.1517e-05 -2.3098e-06 -1.4386e-06 -2.1225e-05
## 2  2.9132e-06 -5.2477e-06  3.3421e-06 -6.0561e-06  2.7789e-06 -3.7524e-06
## 3 -2.9517e-06 -3.1840e-06 -1.5920e-05 -1.2084e-06 -1.5753e-06  1.7394e-05
## 4 -1.3226e-06  8.8201e-06 -1.5879e-05 -4.8111e-06 -7.2829e-07  4.1439e-06
## 5 -6.8366e-08  5.6663e-07 -2.5906e-05 -6.4901e-06 -7.9406e-07  1.3491e-05
## 6 -9.5849e-07  5.2143e-08 -4.7359e-05  6.4537e-07 -2.3041e-06  5.4999e-05
##      attr7    attr8    attr9    attr10    attr11    attr12     attr13
## 1 0.031718 0.031710 0.031721 -0.032963 -0.032962 -0.032941 0.00076881
## 2 0.030804 0.030810 0.030806 -0.033520 -0.033522 -0.033519 0.00076614
## 3 0.032877 0.032880 0.032896 -0.029834 -0.029832 -0.029849 0.00076385
## 4 0.029410 0.029401 0.029417 -0.030156 -0.030155 -0.030159 0.00076950
## 5 0.030119 0.030119 0.030145 -0.031393 -0.031392 -0.031405 0.00076335
## 6 0.031154 0.031154 0.031201 -0.032789 -0.032787 -0.032842 0.00076713
##       attr14     attr15     attr16     attr17     attr18  attr19  attr20
## 1 0.00023244 0.00059982 0.00075698 0.00024722 0.00072498 0.89669 0.89669
## 2 0.00022071 0.00048534 0.00075479 0.00025208 0.00066780 0.89583 0.89583
## 3 0.00022992 0.00056024 0.00075789 0.00023620 0.00071163 0.89583 0.89583
## 4 0.00024423 0.00075301 0.00075545 0.00025668 0.00075448 0.89480 0.89481
## 5 0.00024924 0.00062287 0.00075629 0.00022513 0.00061220 0.89656 0.89656
## 6 0.00025203 0.00064273 0.00075793 0.00026632 0.00073583 0.89458 0.89458
##    attr21  attr22  attr23  attr24      attr25    attr26   attr27
## 1 0.89669 0.89658 0.89658 0.89656  0.00768040  0.257360 -0.71184
## 2 0.89580 0.89677 0.89677 0.89673 -0.00940220 -0.059481 -0.29592
## 3 0.89581 0.89619 0.89619 0.89621  0.00595100 -0.075239 -0.22862
## 4 0.89479 0.89576 0.89576 0.89572  0.00205630  0.466570  0.56841
## 5 0.89655 0.89521 0.89520 0.89520 -0.00086017 -0.904870 -0.57395
## 6 0.89455 0.89573 0.89572 0.89572  0.00048305  0.164030 -0.13124
##       attr28    attr29   attr30     attr31     attr32     attr33
## 1 0.00487890 -0.095775 -0.44126 -0.0013168 -0.0013189 -0.0012477
## 2 0.00711140  0.119110  0.31117  0.0010932  0.0010911  0.0010682
## 3 0.00044468 -0.162300  0.56210  0.0028942  0.0029030  0.0028851
## 4 0.00693590 -0.467240  0.22673 -0.0012546 -0.0012421 -0.0012774
## 5 0.00561650  0.343380  0.84307 -0.0038112 -0.0038040 -0.0038000
## 6 0.00129750 -0.048862  2.21850 -0.0028981 -0.0028984 -0.0028680
##        attr34      attr35      attr36   attr37  attr38  attr39   attr40
## 1 -0.00437770 -0.00438410 -0.00438930 -0.66732  4.3662  6.0168 -0.63308
## 2 -0.00134400 -0.00134190 -0.00137550 -0.65404  1.3977  3.6048 -0.59314
## 3  0.00035014  0.00035803  0.00037366 -0.67146  2.8072  5.8007 -0.63252
## 4 -0.00497380 -0.00496550 -0.00497560 -0.67766  7.8629 23.3960 -0.62289
## 5 -0.00465540 -0.00464600 -0.00463950 -0.65867 14.8720  5.0582 -0.63010
## 6 -0.00151920 -0.00151880 -0.00140970 -0.65298  7.3162  3.9757 -0.61124
##   attr41  attr42  attr43  attr44  attr45  attr46  attr47  attr48 targetVar
## 1 2.9646  8.1198 -1.4961 -1.4961 -1.4961 -1.4996 -1.4996 -1.4996         1
## 2 7.6252  6.1690 -1.4967 -1.4967 -1.4967 -1.5005 -1.5005 -1.5005         1
## 3 2.7784  5.3017 -1.4983 -1.4983 -1.4982 -1.4985 -1.4985 -1.4985         1
## 4 6.5534  6.2606 -1.4963 -1.4963 -1.4963 -1.4975 -1.4975 -1.4976         1
## 5 4.5155  9.5231 -1.4958 -1.4958 -1.4958 -1.4959 -1.4959 -1.4959         1
## 6 5.8337 18.6970 -1.4956 -1.4956 -1.4956 -1.4973 -1.4972 -1.4973         1
sapply(Xy_original, class)
##     attr1     attr2     attr3     attr4     attr5     attr6     attr7 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##     attr8     attr9    attr10    attr11    attr12    attr13    attr14 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr15    attr16    attr17    attr18    attr19    attr20    attr21 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr22    attr23    attr24    attr25    attr26    attr27    attr28 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr29    attr30    attr31    attr32    attr33    attr34    attr35 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr36    attr37    attr38    attr39    attr40    attr41    attr42 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr43    attr44    attr45    attr46    attr47    attr48 targetVar 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "integer"
sapply(Xy_original, function(x) sum(is.na(x)))
##     attr1     attr2     attr3     attr4     attr5     attr6     attr7 
##         0         0         0         0         0         0         0 
##     attr8     attr9    attr10    attr11    attr12    attr13    attr14 
##         0         0         0         0         0         0         0 
##    attr15    attr16    attr17    attr18    attr19    attr20    attr21 
##         0         0         0         0         0         0         0 
##    attr22    attr23    attr24    attr25    attr26    attr27    attr28 
##         0         0         0         0         0         0         0 
##    attr29    attr30    attr31    attr32    attr33    attr34    attr35 
##         0         0         0         0         0         0         0 
##    attr36    attr37    attr38    attr39    attr40    attr41    attr42 
##         0         0         0         0         0         0         0 
##    attr43    attr44    attr45    attr46    attr47    attr48 targetVar 
##         0         0         0         0         0         0         0

1.d) Data Cleaning

# Convert columns from one data type to another
Xy_original$targetVar <- as.factor(Xy_original$targetVar)
# Take a peek at the dataframe after the cleaning
head(Xy_original)
##         attr1       attr2       attr3       attr4       attr5       attr6
## 1 -3.0146e-07  8.2603e-06 -1.1517e-05 -2.3098e-06 -1.4386e-06 -2.1225e-05
## 2  2.9132e-06 -5.2477e-06  3.3421e-06 -6.0561e-06  2.7789e-06 -3.7524e-06
## 3 -2.9517e-06 -3.1840e-06 -1.5920e-05 -1.2084e-06 -1.5753e-06  1.7394e-05
## 4 -1.3226e-06  8.8201e-06 -1.5879e-05 -4.8111e-06 -7.2829e-07  4.1439e-06
## 5 -6.8366e-08  5.6663e-07 -2.5906e-05 -6.4901e-06 -7.9406e-07  1.3491e-05
## 6 -9.5849e-07  5.2143e-08 -4.7359e-05  6.4537e-07 -2.3041e-06  5.4999e-05
##      attr7    attr8    attr9    attr10    attr11    attr12     attr13
## 1 0.031718 0.031710 0.031721 -0.032963 -0.032962 -0.032941 0.00076881
## 2 0.030804 0.030810 0.030806 -0.033520 -0.033522 -0.033519 0.00076614
## 3 0.032877 0.032880 0.032896 -0.029834 -0.029832 -0.029849 0.00076385
## 4 0.029410 0.029401 0.029417 -0.030156 -0.030155 -0.030159 0.00076950
## 5 0.030119 0.030119 0.030145 -0.031393 -0.031392 -0.031405 0.00076335
## 6 0.031154 0.031154 0.031201 -0.032789 -0.032787 -0.032842 0.00076713
##       attr14     attr15     attr16     attr17     attr18  attr19  attr20
## 1 0.00023244 0.00059982 0.00075698 0.00024722 0.00072498 0.89669 0.89669
## 2 0.00022071 0.00048534 0.00075479 0.00025208 0.00066780 0.89583 0.89583
## 3 0.00022992 0.00056024 0.00075789 0.00023620 0.00071163 0.89583 0.89583
## 4 0.00024423 0.00075301 0.00075545 0.00025668 0.00075448 0.89480 0.89481
## 5 0.00024924 0.00062287 0.00075629 0.00022513 0.00061220 0.89656 0.89656
## 6 0.00025203 0.00064273 0.00075793 0.00026632 0.00073583 0.89458 0.89458
##    attr21  attr22  attr23  attr24      attr25    attr26   attr27
## 1 0.89669 0.89658 0.89658 0.89656  0.00768040  0.257360 -0.71184
## 2 0.89580 0.89677 0.89677 0.89673 -0.00940220 -0.059481 -0.29592
## 3 0.89581 0.89619 0.89619 0.89621  0.00595100 -0.075239 -0.22862
## 4 0.89479 0.89576 0.89576 0.89572  0.00205630  0.466570  0.56841
## 5 0.89655 0.89521 0.89520 0.89520 -0.00086017 -0.904870 -0.57395
## 6 0.89455 0.89573 0.89572 0.89572  0.00048305  0.164030 -0.13124
##       attr28    attr29   attr30     attr31     attr32     attr33
## 1 0.00487890 -0.095775 -0.44126 -0.0013168 -0.0013189 -0.0012477
## 2 0.00711140  0.119110  0.31117  0.0010932  0.0010911  0.0010682
## 3 0.00044468 -0.162300  0.56210  0.0028942  0.0029030  0.0028851
## 4 0.00693590 -0.467240  0.22673 -0.0012546 -0.0012421 -0.0012774
## 5 0.00561650  0.343380  0.84307 -0.0038112 -0.0038040 -0.0038000
## 6 0.00129750 -0.048862  2.21850 -0.0028981 -0.0028984 -0.0028680
##        attr34      attr35      attr36   attr37  attr38  attr39   attr40
## 1 -0.00437770 -0.00438410 -0.00438930 -0.66732  4.3662  6.0168 -0.63308
## 2 -0.00134400 -0.00134190 -0.00137550 -0.65404  1.3977  3.6048 -0.59314
## 3  0.00035014  0.00035803  0.00037366 -0.67146  2.8072  5.8007 -0.63252
## 4 -0.00497380 -0.00496550 -0.00497560 -0.67766  7.8629 23.3960 -0.62289
## 5 -0.00465540 -0.00464600 -0.00463950 -0.65867 14.8720  5.0582 -0.63010
## 6 -0.00151920 -0.00151880 -0.00140970 -0.65298  7.3162  3.9757 -0.61124
##   attr41  attr42  attr43  attr44  attr45  attr46  attr47  attr48 targetVar
## 1 2.9646  8.1198 -1.4961 -1.4961 -1.4961 -1.4996 -1.4996 -1.4996         1
## 2 7.6252  6.1690 -1.4967 -1.4967 -1.4967 -1.5005 -1.5005 -1.5005         1
## 3 2.7784  5.3017 -1.4983 -1.4983 -1.4982 -1.4985 -1.4985 -1.4985         1
## 4 6.5534  6.2606 -1.4963 -1.4963 -1.4963 -1.4975 -1.4975 -1.4976         1
## 5 4.5155  9.5231 -1.4958 -1.4958 -1.4958 -1.4959 -1.4959 -1.4959         1
## 6 5.8337 18.6970 -1.4956 -1.4956 -1.4956 -1.4973 -1.4972 -1.4973         1
sapply(Xy_original, class)
##     attr1     attr2     attr3     attr4     attr5     attr6     attr7 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##     attr8     attr9    attr10    attr11    attr12    attr13    attr14 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr15    attr16    attr17    attr18    attr19    attr20    attr21 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr22    attr23    attr24    attr25    attr26    attr27    attr28 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr29    attr30    attr31    attr32    attr33    attr34    attr35 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr36    attr37    attr38    attr39    attr40    attr41    attr42 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr43    attr44    attr45    attr46    attr47    attr48 targetVar 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"  "factor"
sapply(Xy_original, function(x) sum(is.na(x)))
##     attr1     attr2     attr3     attr4     attr5     attr6     attr7 
##         0         0         0         0         0         0         0 
##     attr8     attr9    attr10    attr11    attr12    attr13    attr14 
##         0         0         0         0         0         0         0 
##    attr15    attr16    attr17    attr18    attr19    attr20    attr21 
##         0         0         0         0         0         0         0 
##    attr22    attr23    attr24    attr25    attr26    attr27    attr28 
##         0         0         0         0         0         0         0 
##    attr29    attr30    attr31    attr32    attr33    attr34    attr35 
##         0         0         0         0         0         0         0 
##    attr36    attr37    attr38    attr39    attr40    attr41    attr42 
##         0         0         0         0         0         0         0 
##    attr43    attr44    attr45    attr46    attr47    attr48 targetVar 
##         0         0         0         0         0         0         0

1.e) Splitting Data into Training and Test Sets

# Use variable totCol to hold the number of columns in the dataframe
totCol <- ncol(Xy_original)

# Set up variable totAttr for the total number of attribute columns
totAttr <- totCol-1
# targetCol variable indicates the column location of the target/class variable
# If the first column, set targetCol to 1. If the last column, set targetCol to totCol
# if (targetCol <> 1) and (targetCol <> totCol), be aware when slicing up the dataframes for visualization! 
targetCol <- totCol

# Standardize the class column to the name of targetVar if applicable
# colnames(Xy_original)[targetCol] <- "targetVar"
# Create various sub-datasets for visualization and cleaning/transformation operations.
set.seed(seedNum)

# Use 75% of the data to train the models and the remaining for testing/validation
training_index <- createDataPartition(Xy_original$targetVar, p=0.75, list=FALSE)
Xy_train <- Xy_original[training_index,]
Xy_test <- Xy_original[-training_index,]

if (targetCol==1) {
  X_train <- Xy_train[,(targetCol+1):totCol]
  y_train <- Xy_train[,targetCol]
  y_test <- Xy_test[,targetCol]
} else {
  X_train <- Xy_train[,1:(totAttr)]
  y_train <- Xy_train[,totCol]
  y_test <- Xy_test[,totCol]
}

1.f) Set up the parameters for data visualization

# Set up the number of row and columns for visualization display. dispRow * dispCol should be >= totAttr
dispCol <- 3
if (totAttr%%dispCol == 0) {
dispRow <- totAttr%/%dispCol
} else {
dispRow <- (totAttr%/%dispCol) + 1
}
cat("Will attempt to create graphics grid (col x row): ", dispCol, ' by ', dispRow)
## Will attempt to create graphics grid (col x row):  3  by  16
if (notifyStatus) email_notify(paste("Library and Data Loading completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@30c7da1e}"

2. Summarize Data

To gain a better understanding of the data that we have on-hand, we will leverage a number of descriptive statistics and data visualization techniques. The plan is to use the results to consider new questions, review assumptions, and validate hypotheses that we can investigate later with specialized models.

if (notifyStatus) email_notify(paste("Data Summarization and Visualization has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@2812cbfa}"

2.a) Descriptive statistics

2.a.i) Peek at the data itself

head(Xy_train)
##         attr1       attr2       attr3       attr4       attr5       attr6
## 1 -3.0146e-07  8.2603e-06 -1.1517e-05 -2.3098e-06 -1.4386e-06 -2.1225e-05
## 2  2.9132e-06 -5.2477e-06  3.3421e-06 -6.0561e-06  2.7789e-06 -3.7524e-06
## 4 -1.3226e-06  8.8201e-06 -1.5879e-05 -4.8111e-06 -7.2829e-07  4.1439e-06
## 5 -6.8366e-08  5.6663e-07 -2.5906e-05 -6.4901e-06 -7.9406e-07  1.3491e-05
## 8 -2.5666e-06 -1.6795e-07  1.4838e-05 -1.5984e-06  8.7092e-07  1.4961e-05
## 9 -5.4740e-06  1.0865e-07 -1.0972e-05 -1.8156e-06  4.7578e-07  2.3783e-05
##      attr7    attr8    attr9    attr10    attr11    attr12     attr13
## 1 0.031718 0.031710 0.031721 -0.032963 -0.032962 -0.032941 0.00076881
## 2 0.030804 0.030810 0.030806 -0.033520 -0.033522 -0.033519 0.00076614
## 4 0.029410 0.029401 0.029417 -0.030156 -0.030155 -0.030159 0.00076950
## 5 0.030119 0.030119 0.030145 -0.031393 -0.031392 -0.031405 0.00076335
## 8 0.031071 0.031071 0.031056 -0.029695 -0.029696 -0.029711 0.00076645
## 9 0.031309 0.031308 0.031319 -0.031344 -0.031345 -0.031368 0.00076907
##       attr14     attr15     attr16     attr17     attr18  attr19  attr20
## 1 0.00023244 0.00059982 0.00075698 0.00024722 0.00072498 0.89669 0.89669
## 2 0.00022071 0.00048534 0.00075479 0.00025208 0.00066780 0.89583 0.89583
## 4 0.00024423 0.00075301 0.00075545 0.00025668 0.00075448 0.89480 0.89481
## 5 0.00024924 0.00062287 0.00075629 0.00022513 0.00061220 0.89656 0.89656
## 8 0.00022783 0.00065398 0.00075673 0.00025627 0.00067641 0.89635 0.89634
## 9 0.00022925 0.00054730 0.00075881 0.00024424 0.00071279 0.89718 0.89718
##    attr21  attr22  attr23  attr24      attr25    attr26   attr27    attr28
## 1 0.89669 0.89658 0.89658 0.89656  0.00768040  0.257360 -0.71184 0.0048789
## 2 0.89580 0.89677 0.89677 0.89673 -0.00940220 -0.059481 -0.29592 0.0071114
## 4 0.89479 0.89576 0.89576 0.89572  0.00205630  0.466570  0.56841 0.0069359
## 5 0.89655 0.89521 0.89520 0.89520 -0.00086017 -0.904870 -0.57395 0.0056165
## 8 0.89634 0.89864 0.89864 0.89862  0.00202540 -1.080900  0.40767 0.0014612
## 9 0.89715 0.89650 0.89649 0.89649  0.01032800 -0.036135 -0.26122 0.0075963
##      attr29   attr30     attr31     attr32     attr33     attr34
## 1 -0.095775 -0.44126 -0.0013168 -0.0013189 -0.0012477 -0.0043777
## 2  0.119110  0.31117  0.0010932  0.0010911  0.0010682 -0.0013440
## 4 -0.467240  0.22673 -0.0012546 -0.0012421 -0.0012774 -0.0049738
## 5  0.343380  0.84307 -0.0038112 -0.0038040 -0.0038000 -0.0046554
## 8 -0.487020  0.52455 -0.0023039 -0.0022918 -0.0023070 -0.0033377
## 9  0.064140  0.54750 -0.0027887 -0.0027917 -0.0027475 -0.0025581
##       attr35     attr36   attr37  attr38  attr39   attr40  attr41 attr42
## 1 -0.0043841 -0.0043893 -0.66732  4.3662  6.0168 -0.63308  2.9646 8.1198
## 2 -0.0013419 -0.0013755 -0.65404  1.3977  3.6048 -0.59314  7.6252 6.1690
## 4 -0.0049655 -0.0049756 -0.67766  7.8629 23.3960 -0.62289  6.5534 6.2606
## 5 -0.0046460 -0.0046395 -0.65867 14.8720  5.0582 -0.63010  4.5155 9.5231
## 8 -0.0033242 -0.0033195 -0.64483 11.9410  9.8584 -0.63618 12.8380 4.9079
## 9 -0.0025556 -0.0024461 -0.63549  3.0478  3.0311 -0.62465  2.9841 6.6883
##    attr43  attr44  attr45  attr46  attr47  attr48 targetVar
## 1 -1.4961 -1.4961 -1.4961 -1.4996 -1.4996 -1.4996         1
## 2 -1.4967 -1.4967 -1.4967 -1.5005 -1.5005 -1.5005         1
## 4 -1.4963 -1.4963 -1.4963 -1.4975 -1.4975 -1.4976         1
## 5 -1.4958 -1.4958 -1.4958 -1.4959 -1.4959 -1.4959         1
## 8 -1.4990 -1.4990 -1.4990 -1.4968 -1.4968 -1.4968         1
## 9 -1.4955 -1.4955 -1.4955 -1.4947 -1.4947 -1.4947         1

2.a.ii) Dimensions of the dataset

dim(Xy_train)
## [1] 43890    49

2.a.iii) Types of the attribute

sapply(Xy_train, class)
##     attr1     attr2     attr3     attr4     attr5     attr6     attr7 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##     attr8     attr9    attr10    attr11    attr12    attr13    attr14 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr15    attr16    attr17    attr18    attr19    attr20    attr21 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr22    attr23    attr24    attr25    attr26    attr27    attr28 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr29    attr30    attr31    attr32    attr33    attr34    attr35 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr36    attr37    attr38    attr39    attr40    attr41    attr42 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr43    attr44    attr45    attr46    attr47    attr48 targetVar 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"  "factor"

2.a.iv) Statistical summary of the attributes

summary(Xy_train)
##      attr1                attr2                attr3           
##  Min.   :-1.372e-02   Min.   :-5.414e-03   Min.   :-1.358e-02  
##  1st Qu.:-7.414e-06   1st Qu.:-1.446e-05   1st Qu.:-7.285e-05  
##  Median :-2.643e-06   Median : 8.840e-07   Median : 5.700e-07  
##  Mean   :-3.427e-06   Mean   : 1.271e-06   Mean   : 1.140e-06  
##  3rd Qu.: 1.596e-06   3rd Qu.: 1.872e-05   3rd Qu.: 7.467e-05  
##  Max.   : 5.784e-03   Max.   : 4.525e-03   Max.   : 5.238e-03  
##                                                                
##      attr4                attr5                attr6           
##  Min.   :-1.279e-02   Min.   :-8.356e-03   Min.   :-9.741e-03  
##  1st Qu.:-5.416e-06   1st Qu.:-1.466e-05   1st Qu.:-7.310e-05  
##  Median :-1.072e-06   Median : 7.730e-07   Median : 5.400e-08  
##  Mean   :-1.436e-06   Mean   : 1.280e-06   Mean   : 5.590e-07  
##  3rd Qu.: 3.536e-06   3rd Qu.: 1.914e-05   3rd Qu.: 7.155e-05  
##  Max.   : 1.453e-03   Max.   : 8.245e-04   Max.   : 2.754e-03  
##                                                                
##      attr7               attr8               attr9          
##  Min.   :-0.139890   Min.   :-0.135940   Min.   :-0.130860  
##  1st Qu.:-0.019952   1st Qu.:-0.019958   1st Qu.:-0.019925  
##  Median : 0.013206   Median : 0.013215   Median : 0.013237  
##  Mean   : 0.001904   Mean   : 0.001903   Mean   : 0.001902  
##  3rd Qu.: 0.024751   3rd Qu.: 0.024754   3rd Qu.: 0.024765  
##  Max.   : 0.068515   Max.   : 0.068515   Max.   : 0.068509  
##                                                             
##      attr10             attr11             attr12        
##  Min.   :-0.21725   Min.   :-0.21725   Min.   :-0.21727  
##  1st Qu.:-0.03217   1st Qu.:-0.03218   1st Qu.:-0.03219  
##  Median :-0.01561   Median :-0.01564   Median :-0.01566  
##  Mean   :-0.01190   Mean   :-0.01190   Mean   :-0.01191  
##  3rd Qu.: 0.02067   3rd Qu.: 0.02067   3rd Qu.: 0.02068  
##  Max.   : 0.35258   Max.   : 0.35256   Max.   : 0.35263  
##                                                          
##      attr13             attr14              attr15         
##  Min.   :0.000751   Min.   :0.0001888   Min.   :0.0003542  
##  1st Qu.:0.001136   1st Qu.:0.0005991   1st Qu.:0.0012535  
##  Median :0.002199   Median :0.0011846   Median :0.0029788  
##  Mean   :0.001878   Mean   :0.0010849   Mean   :0.0030942  
##  3rd Qu.:0.002525   3rd Qu.:0.0014568   3rd Qu.:0.0043396  
##  Max.   :0.136570   Max.   :0.0515430   Max.   :0.1039300  
##                                                            
##      attr16              attr17              attr18         
##  Min.   :0.0007467   Min.   :0.0001889   Min.   :0.0003642  
##  1st Qu.:0.0011392   1st Qu.:0.0005972   1st Qu.:0.0012856  
##  Median :0.0021875   Median :0.0011824   Median :0.0028925  
##  Mean   :0.0018685   Mean   :0.0010793   Mean   :0.0030768  
##  3rd Qu.:0.0025227   3rd Qu.:0.0014535   3rd Qu.:0.0043180  
##  Max.   :0.1087700   Max.   :0.0647640   Max.   :0.0785300  
##                                                             
##      attr19           attr20           attr21           attr22      
##  Min.   :0.7979   Min.   :0.7979   Min.   :0.7979   Min.   :0.7984  
##  1st Qu.:1.3275   1st Qu.:1.3274   1st Qu.:1.3268   1st Qu.:1.3289  
##  Median :1.5729   Median :1.5728   Median :1.5726   Median :1.5723  
##  Mean   :1.6183   Mean   :1.6182   Mean   :1.6178   Mean   :1.6178  
##  3rd Qu.:1.8856   3rd Qu.:1.8855   3rd Qu.:1.8848   3rd Qu.:1.8831  
##  Max.   :2.3757   Max.   :2.3754   Max.   :2.3750   Max.   :2.3728  
##                                                                     
##      attr23           attr24           attr25          
##  Min.   :0.7984   Min.   :0.7984   Min.   :-15.796000  
##  1st Qu.:1.3289   1st Qu.:1.3282   1st Qu.: -0.006039  
##  Median :1.5723   Median :1.5721   Median :  0.003068  
##  Mean   :1.6177   Mean   :1.6172   Mean   :  0.001857  
##  3rd Qu.:1.8831   3rd Qu.:1.8823   3rd Qu.:  0.011598  
##  Max.   :2.3726   Max.   :2.3715   Max.   : 28.285000  
##                                                        
##      attr26               attr27              attr28          
##  Min.   :-12.351000   Min.   :-7.959000   Min.   :-11.903000  
##  1st Qu.: -0.208978   1st Qu.:-0.454458   1st Qu.: -0.009192  
##  Median :  0.006090   Median :-0.002214   Median :  0.000180  
##  Mean   :  0.005814   Mean   :-0.003648   Mean   :  0.000050  
##  3rd Qu.:  0.220148   3rd Qu.: 0.445820   3rd Qu.:  0.008727  
##  Max.   : 12.437000   Max.   : 9.580300   Max.   : 18.294000  
##                                                               
##      attr29              attr30              attr31          
##  Min.   :-12.50800   Min.   :-9.976600   Min.   :-0.0502350  
##  1st Qu.: -0.20397   1st Qu.:-0.450455   1st Qu.:-0.0051046  
##  Median :  0.00778   Median :-0.002608   Median : 0.0004445  
##  Mean   :  0.01392   Mean   :-0.009182   Mean   :-0.0000006  
##  3rd Qu.:  0.22640   3rd Qu.: 0.431915   3rd Qu.: 0.0051417  
##  Max.   : 10.97700   Max.   : 8.764000   Max.   : 0.0863780  
##                                                              
##      attr32               attr33               attr34          
##  Min.   :-5.189e-02   Min.   :-5.279e-02   Min.   :-0.3377100  
##  1st Qu.:-5.114e-03   1st Qu.:-5.110e-03   1st Qu.:-0.0044650  
##  Median : 4.404e-04   Median : 4.581e-04   Median :-0.0002848  
##  Mean   :-2.350e-06   Mean   : 2.270e-06   Mean   :-0.0000209  
##  3rd Qu.: 5.141e-03   3rd Qu.: 5.148e-03   3rd Qu.: 0.0049518  
##  Max.   : 8.646e-02   Max.   : 8.655e-02   Max.   : 0.1948200  
##                                                                
##      attr35               attr36               attr37        
##  Min.   :-0.3377000   Min.   :-0.3377500   Min.   :  -0.906  
##  1st Qu.:-0.0044693   1st Qu.:-0.0044516   1st Qu.:  -0.715  
##  Median :-0.0002833   Median :-0.0002797   Median :  -0.664  
##  Mean   :-0.0000240   Mean   :-0.0000187   Mean   :  -0.403  
##  3rd Qu.: 0.0049415   3rd Qu.: 0.0049538   3rd Qu.:  -0.581  
##  Max.   : 0.1902000   Max.   : 0.1850300   Max.   :4015.400  
##                                                              
##      attr38             attr39             attr40        
##  Min.   : -0.6126   Min.   :  0.5222   Min.   :  -0.902  
##  1st Qu.:  1.4893   1st Qu.:  4.4520   1st Qu.:  -0.715  
##  Median :  3.2989   Median :  6.5580   Median :  -0.661  
##  Mean   :  7.4795   Mean   :  8.4185   Mean   :  -0.322  
##  3rd Qu.:  8.3635   3rd Qu.:  9.9641   3rd Qu.:  -0.574  
##  Max.   :312.5200   Max.   :265.3300   Max.   :3670.800  
##                                                          
##      attr41             attr42             attr43           attr44      
##  Min.   : -0.5968   Min.   :  0.3207   Min.   :-1.526   Min.   :-1.526  
##  1st Qu.:  1.4490   1st Qu.:  4.4429   1st Qu.:-1.503   1st Qu.:-1.503  
##  Median :  3.3085   Median :  6.4859   Median :-1.500   Median :-1.500  
##  Mean   :  7.3015   Mean   :  8.2926   Mean   :-1.501   Mean   :-1.501  
##  3rd Qu.:  8.3260   3rd Qu.:  9.8585   3rd Qu.:-1.498   3rd Qu.:-1.498  
##  Max.   :889.9300   Max.   :153.1500   Max.   :-1.458   Max.   :-1.456  
##                                                                         
##      attr45           attr46           attr47           attr48      
##  Min.   :-1.524   Min.   :-1.521   Min.   :-1.523   Min.   :-1.521  
##  1st Qu.:-1.503   1st Qu.:-1.500   1st Qu.:-1.500   1st Qu.:-1.500  
##  Median :-1.500   Median :-1.498   Median :-1.498   Median :-1.498  
##  Mean   :-1.501   Mean   :-1.498   Mean   :-1.498   Mean   :-1.498  
##  3rd Qu.:-1.498   3rd Qu.:-1.496   3rd Qu.:-1.496   3rd Qu.:-1.496  
##  Max.   :-1.456   Max.   :-1.337   Max.   :-1.337   Max.   :-1.337  
##                                                                     
##    targetVar    
##  1      : 3990  
##  2      : 3990  
##  3      : 3990  
##  4      : 3990  
##  5      : 3990  
##  6      : 3990  
##  (Other):19950

2.a.v) Summarize the levels of the class attribute

cbind(freq=table(y_train), percentage=prop.table(table(y_train))*100)
##    freq percentage
## 1  3990   9.090909
## 2  3990   9.090909
## 3  3990   9.090909
## 4  3990   9.090909
## 5  3990   9.090909
## 6  3990   9.090909
## 7  3990   9.090909
## 8  3990   9.090909
## 9  3990   9.090909
## 10 3990   9.090909
## 11 3990   9.090909

2.b) Data visualizations

# Boxplots for each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
    boxplot(X_train[,i], main=names(X_train)[i])
}

# Histograms each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
    hist(X_train[,i], main=names(X_train)[i])
}

# Density plot for each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
    plot(density(X_train[,i]), main=names(X_train)[i])
}

# Correlation matrix
correlations <- cor(X_train)
corrplot(correlations, method="circle")

if (notifyStatus) email_notify(paste("Data Summarization and Visualization completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@2f7a2457}"

3. Prepare Data

Some dataset may require additional preparation activities that will best exposes the structure of the problem and the relationships between the input attributes and the output variable. Some data-prep tasks might include:

if (notifyStatus) email_notify(paste("Data Cleaning and Transformation has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@5ebec15}"

3.a) Data Transforms

# Not applicable for this iteration of the project

3.b) Feature Selection

# Not applicable for this iteration of the project

3.c) Display the Final Datasets for Model-Building

# We finalize the training and testing datasets for the modeling activities
dim(Xy_train)
## [1] 43890    49
dim(Xy_test)
## [1] 14619    49
if (notifyStatus) email_notify(paste("Data Cleaning and Transformation completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@38082d64}"
proc.time()-startTimeScript
##    user  system elapsed 
##  36.698   1.325  46.868

4. Model and Evaluate Algorithms

After the data-prep, we next work on finding a workable model by evaluating a subset of machine learning algorithms that are good at exploiting the structure of the training. The typical evaluation tasks include:

For this project, we will evaluate one linear, one non-linear, and three ensemble algorithms:

Linear Algorithm: Linear Discriminant Analysis

Non-Linear Algorithm: Decision Trees (CART)

Ensemble Algorithms: Bagged CART, Random Forest, and Gradient Boosting

The random number seed is reset before each run to ensure that the evaluation of each algorithm is performed using the same data splits. It ensures the results are directly comparable.

4.a) Generate models using linear algorithms

startModeling <- proc.time()
# Linear Discriminant Analysis (Classification)
# if (notifyStatus) email_notify(paste("Linear Discriminant Analysis modeling has begun!",date()))
# startTimeModule <- proc.time()
# set.seed(seedNum)
# fit.lda <- train(targetVar~., data=Xy_train, method="lda", metric=metricTarget, trControl=control)
# print(fit.lda)
# proc.time()-startTimeModule
# if (notifyStatus) email_notify(paste("Linear Discriminant Analysis modeling completed!",date()))

4.b) Generate models using nonlinear algorithms

# Decision Tree - CART (Regression/Classification)
if (notifyStatus) email_notify(paste("Decision Tree modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@180bc464}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.cart <- train(targetVar~., data=Xy_train, method="rpart", metric=metricTarget, trControl=control)
print(fit.cart)
## CART 
## 
## 43890 samples
##    48 predictor
##    11 classes: '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 39501, 39501, 39501, 39501, 39501, 39501, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy    Kappa    
##   0.09927318  0.43572568  0.3792982
##   0.09994987  0.23627250  0.1598997
##   0.10000000  0.09090909  0.0000000
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.09927318.
proc.time()-startTimeModule
##    user  system elapsed 
##  57.317   1.619  57.854
if (notifyStatus) email_notify(paste("Decision Tree modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@2d554825}"

4.c) Generate models using ensemble algorithms

In this section, we will explore the use and tuning of ensemble algorithms to see whether we can improve the results.

# Bagged CART (Regression/Classification)
if (notifyStatus) email_notify(paste("Bagged CART modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@4909b8da}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.bagcart <- train(targetVar~., data=Xy_train, method="treebag", metric=metricTarget, trControl=control)
print(fit.bagcart)
## Bagged CART 
## 
## 43890 samples
##    48 predictor
##    11 classes: '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 39501, 39501, 39501, 39501, 39501, 39501, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9879471  0.9867419
proc.time()-startTimeModule
##     user   system  elapsed 
## 1101.061   27.519 1100.999
if (notifyStatus) email_notify(paste("Bagged CART modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@54a097cc}"
# Random Forest (Regression/Classification)
if (notifyStatus) email_notify(paste("Random Forest modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@50f8360d}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.rf <- train(targetVar~., data=Xy_train, method="rf", metric=metricTarget, trControl=control)
print(fit.rf)
## Random Forest 
## 
## 43890 samples
##    48 predictor
##    11 classes: '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 39501, 39501, 39501, 39501, 39501, 39501, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9990203  0.9989223
##   25    0.9932559  0.9925815
##   48    0.9898154  0.9887970
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
proc.time()-startTimeModule
##     user   system  elapsed 
## 5315.285   11.920 5336.994
if (notifyStatus) email_notify(paste("Random Forest modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@337d0578}"
# Gradient Boosting (Regression/Classification)
if (notifyStatus) email_notify(paste("Gradient Boosting modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@2669b199}"
startTimeModule <- proc.time()
set.seed(seedNum)
# fit.gbm <- train(targetVar~., data=Xy_train, method="gbm", metric=metricTarget, trControl=control, verbose=F)
fit.gbm <- train(targetVar~., data=Xy_train, method="xgbTree", metric=metricTarget, trControl=control, verbose=F)
print(fit.gbm)
## eXtreme Gradient Boosting 
## 
## 43890 samples
##    48 predictor
##    11 classes: '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 39501, 39501, 39501, 39501, 39501, 39501, ... 
## Resampling results across tuning parameters:
## 
##   eta  max_depth  colsample_bytree  subsample  nrounds  Accuracy 
##   0.3  1          0.6               0.50        50      0.9186603
##   0.3  1          0.6               0.50       100      0.9614719
##   0.3  1          0.6               0.50       150      0.9755753
##   0.3  1          0.6               0.75        50      0.9204375
##   0.3  1          0.6               0.75       100      0.9609934
##   0.3  1          0.6               0.75       150      0.9750740
##   0.3  1          0.6               1.00        50      0.9202552
##   0.3  1          0.6               1.00       100      0.9621554
##   0.3  1          0.6               1.00       150      0.9752335
##   0.3  1          0.8               0.50        50      0.9180679
##   0.3  1          0.8               0.50       100      0.9606972
##   0.3  1          0.8               0.50       150      0.9749146
##   0.3  1          0.8               0.75        50      0.9200729
##   0.3  1          0.8               0.75       100      0.9616541
##   0.3  1          0.8               0.75       150      0.9754614
##   0.3  1          0.8               1.00        50      0.9203008
##   0.3  1          0.8               1.00       100      0.9616769
##   0.3  1          0.8               1.00       150      0.9754386
##   0.3  2          0.6               0.50        50      0.9860333
##   0.3  2          0.6               0.50       100      0.9959216
##   0.3  2          0.6               0.50       150      0.9975621
##   0.3  2          0.6               0.75        50      0.9855320
##   0.3  2          0.6               0.75       100      0.9957393
##   0.3  2          0.6               0.75       150      0.9977216
##   0.3  2          0.6               1.00        50      0.9856459
##   0.3  2          0.6               1.00       100      0.9963317
##   0.3  2          0.6               1.00       150      0.9980406
##   0.3  2          0.8               0.50        50      0.9854181
##   0.3  2          0.8               0.50       100      0.9955571
##   0.3  2          0.8               0.50       150      0.9974482
##   0.3  2          0.8               0.75        50      0.9856004
##   0.3  2          0.8               0.75       100      0.9961267
##   0.3  2          0.8               0.75       150      0.9977899
##   0.3  2          0.8               1.00        50      0.9856687
##   0.3  2          0.8               1.00       100      0.9962634
##   0.3  2          0.8               1.00       150      0.9978355
##   0.3  3          0.6               0.50        50      0.9951697
##   0.3  3          0.6               0.50       100      0.9982912
##   0.3  3          0.6               0.50       150      0.9987697
##   0.3  3          0.6               0.75        50      0.9959444
##   0.3  3          0.6               0.75       100      0.9985190
##   0.3  3          0.6               0.75       150      0.9987241
##   0.3  3          0.6               1.00        50      0.9962178
##   0.3  3          0.6               1.00       100      0.9986102
##   0.3  3          0.6               1.00       150      0.9988152
##   0.3  3          0.8               0.50        50      0.9956254
##   0.3  3          0.8               0.50       100      0.9982000
##   0.3  3          0.8               0.50       150      0.9985418
##   0.3  3          0.8               0.75        50      0.9959444
##   0.3  3          0.8               0.75       100      0.9986102
##   0.3  3          0.8               0.75       150      0.9987924
##   0.3  3          0.8               1.00        50      0.9960811
##   0.3  3          0.8               1.00       100      0.9984735
##   0.3  3          0.8               1.00       150      0.9988152
##   0.4  1          0.6               0.50        50      0.9419002
##   0.4  1          0.6               0.50       100      0.9714969
##   0.4  1          0.6               0.50       150      0.9825245
##   0.4  1          0.6               0.75        50      0.9436090
##   0.4  1          0.6               0.75       100      0.9715881
##   0.4  1          0.6               0.75       150      0.9824106
##   0.4  1          0.6               1.00        50      0.9442242
##   0.4  1          0.6               1.00       100      0.9719070
##   0.4  1          0.6               1.00       150      0.9828207
##   0.4  1          0.8               0.50        50      0.9424470
##   0.4  1          0.8               0.50       100      0.9714741
##   0.4  1          0.8               0.50       150      0.9829346
##   0.4  1          0.8               0.75        50      0.9438596
##   0.4  1          0.8               0.75       100      0.9715197
##   0.4  1          0.8               0.75       150      0.9828663
##   0.4  1          0.8               1.00        50      0.9443153
##   0.4  1          0.8               1.00       100      0.9720893
##   0.4  1          0.8               1.00       150      0.9828890
##   0.4  2          0.6               0.50        50      0.9919116
##   0.4  2          0.6               0.50       100      0.9968330
##   0.4  2          0.6               0.50       150      0.9978811
##   0.4  2          0.6               0.75        50      0.9917521
##   0.4  2          0.6               0.75       100      0.9976532
##   0.4  2          0.6               0.75       150      0.9982912
##   0.4  2          0.6               1.00        50      0.9923217
##   0.4  2          0.6               1.00       100      0.9976988
##   0.4  2          0.6               1.00       150      0.9984507
##   0.4  2          0.8               0.50        50      0.9917065
##   0.4  2          0.8               0.50       100      0.9972431
##   0.4  2          0.8               0.50       150      0.9979950
##   0.4  2          0.8               0.75        50      0.9918432
##   0.4  2          0.8               0.75       100      0.9975849
##   0.4  2          0.8               0.75       150      0.9985190
##   0.4  2          0.8               1.00        50      0.9921394
##   0.4  2          0.8               1.00       100      0.9977216
##   0.4  2          0.8               1.00       150      0.9985190
##   0.4  3          0.6               0.50        50      0.9972659
##   0.4  3          0.6               0.50       100      0.9984735
##   0.4  3          0.6               0.50       150      0.9986329
##   0.4  3          0.6               0.75        50      0.9974937
##   0.4  3          0.6               0.75       100      0.9987013
##   0.4  3          0.6               0.75       150      0.9987697
##   0.4  3          0.6               1.00        50      0.9977444
##   0.4  3          0.6               1.00       100      0.9987241
##   0.4  3          0.6               1.00       150      0.9988608
##   0.4  3          0.8               0.50        50      0.9971975
##   0.4  3          0.8               0.50       100      0.9985418
##   0.4  3          0.8               0.50       150      0.9987013
##   0.4  3          0.8               0.75        50      0.9974254
##   0.4  3          0.8               0.75       100      0.9987013
##   0.4  3          0.8               0.75       150      0.9988608
##   0.4  3          0.8               1.00        50      0.9978583
##   0.4  3          0.8               1.00       100      0.9987241
##   0.4  3          0.8               1.00       150      0.9987697
##   Kappa    
##   0.9105263
##   0.9576190
##   0.9731328
##   0.9124812
##   0.9570927
##   0.9725815
##   0.9122807
##   0.9583709
##   0.9727569
##   0.9098747
##   0.9567669
##   0.9724060
##   0.9120802
##   0.9578195
##   0.9730075
##   0.9123308
##   0.9578446
##   0.9729825
##   0.9846366
##   0.9955138
##   0.9973183
##   0.9840852
##   0.9953133
##   0.9974937
##   0.9842105
##   0.9959649
##   0.9978446
##   0.9839599
##   0.9951128
##   0.9971930
##   0.9841604
##   0.9957393
##   0.9975689
##   0.9842356
##   0.9958897
##   0.9976190
##   0.9946867
##   0.9981203
##   0.9986466
##   0.9955388
##   0.9983709
##   0.9985965
##   0.9958396
##   0.9984712
##   0.9986967
##   0.9951880
##   0.9980201
##   0.9983960
##   0.9955388
##   0.9984712
##   0.9986717
##   0.9956892
##   0.9983208
##   0.9986967
##   0.9360902
##   0.9686466
##   0.9807769
##   0.9379699
##   0.9687469
##   0.9806516
##   0.9386466
##   0.9690977
##   0.9811028
##   0.9366917
##   0.9686216
##   0.9812281
##   0.9382456
##   0.9686717
##   0.9811529
##   0.9387469
##   0.9692982
##   0.9811779
##   0.9911028
##   0.9965163
##   0.9976692
##   0.9909273
##   0.9974185
##   0.9981203
##   0.9915539
##   0.9974687
##   0.9982957
##   0.9908772
##   0.9969674
##   0.9977945
##   0.9910276
##   0.9973434
##   0.9983709
##   0.9913534
##   0.9974937
##   0.9983709
##   0.9969925
##   0.9983208
##   0.9984962
##   0.9972431
##   0.9985714
##   0.9986466
##   0.9975188
##   0.9985965
##   0.9987469
##   0.9969173
##   0.9983960
##   0.9985714
##   0.9971679
##   0.9985714
##   0.9987469
##   0.9976441
##   0.9985965
##   0.9986466
## 
## Tuning parameter 'gamma' was held constant at a value of 0
## 
## Tuning parameter 'min_child_weight' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 150, max_depth = 3,
##  eta = 0.4, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1
##  and subsample = 1.
proc.time()-startTimeModule
##      user    system   elapsed 
## 42529.733   131.003 21614.814
if (notifyStatus) email_notify(paste("Gradient Boosting modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@3c756e4d}"

4.d) Compare baseline algorithms

results <- resamples(list(CART=fit.cart, BagCART=fit.bagcart, RF=fit.rf, GBM=fit.gbm))
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: CART, BagCART, RF, GBM 
## Number of resamples: 10 
## 
## Accuracy 
##              Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## CART    0.3634085 0.4534062 0.4535202 0.4357257 0.4540328 0.4543176    0
## BagCART 0.9854181 0.9863295 0.9876965 0.9879471 0.9895762 0.9906585    0
## RF      0.9979494 0.9988608 0.9992026 0.9990203 0.9993165 0.9995443    0
## GBM     0.9977216 0.9988608 0.9988608 0.9988608 0.9990886 0.9995443    0
## 
## Kappa 
##              Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## CART    0.2997494 0.3987469 0.3988722 0.3792982 0.3994361 0.3997494    0
## BagCART 0.9839599 0.9849624 0.9864662 0.9867419 0.9885338 0.9897243    0
## RF      0.9977444 0.9987469 0.9991228 0.9989223 0.9992481 0.9994987    0
## GBM     0.9974937 0.9987469 0.9987469 0.9987469 0.9989975 0.9994987    0
dotplot(results)

cat('The average accuracy from all models is:',
    mean(c(results$values$`LDA~Accuracy`,results$values$`CART~Accuracy`,results$values$`BagCART~Accuracy`,results$values$`RF~Accuracy`,results$values$`GBM~Accuracy`)),'\n')
## The average accuracy from all models is: 0.8553885
cat('Total training time for all models:',proc.time()-startModeling)
## Total training time for all models: 49008.17 172.165 28127.53 0 0

5. Improve Accuracy or Results

After we achieve a short list of machine learning algorithms with good level of accuracy, we can leverage ways to improve the accuracy of the models.

Using the three best-perfoming algorithms from the previous section, we will Search for a combination of parameters for each algorithm that yields the best results.

5.a) Algorithm Tuning

Finally, we will tune the best-performing algorithms from each group further and see whether we can get more accuracy out of them.

# Tuning algorithm #1 - Random Forest
if (notifyStatus) email_notify(paste("Algorithm #1 tuning has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@4439f31e}"
startTimeModule <- proc.time()
set.seed(seedNum)
grid <- expand.grid(mtry = c(2, 13, 25, 36, 48))
fit.final1 <- train(targetVar~., data=Xy_train, method="rf", metric=metricTarget, tuneGrid=grid, trControl=control)
plot(fit.final1)

print(fit.final1)
## Random Forest 
## 
## 43890 samples
##    48 predictor
##    11 classes: '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 39501, 39501, 39501, 39501, 39501, 39501, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9990203  0.9989223
##   13    0.9964912  0.9961404
##   25    0.9932787  0.9926065
##   36    0.9913192  0.9904511
##   48    0.9899294  0.9889223
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
proc.time()-startTimeModule
##      user    system   elapsed 
## 10260.884    19.209 10296.140
if (notifyStatus) email_notify(paste("Algorithm #1 tuning completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@2a2d45ba}"
# Tuning algorithm #2 - Gradient Boosting
if (notifyStatus) email_notify(paste("Algorithm #2 tuning has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@457e2f02}"
startTimeModule <- proc.time()
set.seed(seedNum)
grid <- expand.grid(nrounds=c(100, 150, 200, 300), max_depth=3, eta=0.4, gamma=0, colsample_bytree=0.6, min_child_weight=1, subsample=1)
fit.final2 <- train(targetVar~., data=Xy_train, method="xgbTree", metric=metricTarget, tuneGrid=grid, trControl=control, verbose=F)
plot(fit.final2)

print(fit.final2)
## eXtreme Gradient Boosting 
## 
## 43890 samples
##    48 predictor
##    11 classes: '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 39501, 39501, 39501, 39501, 39501, 39501, ... 
## Resampling results across tuning parameters:
## 
##   nrounds  Accuracy   Kappa    
##   100      0.9988152  0.9986967
##   150      0.9989975  0.9988972
##   200      0.9989747  0.9988722
##   300      0.9989975  0.9988972
## 
## Tuning parameter 'max_depth' was held constant at a value of 3
##  0.6
## Tuning parameter 'min_child_weight' was held constant at a value of
##  1
## Tuning parameter 'subsample' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 150, max_depth = 3,
##  eta = 0.4, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1
##  and subsample = 1.
proc.time()-startTimeModule
##     user   system  elapsed 
## 2274.833    3.913 1150.048
if (notifyStatus) email_notify(paste("Algorithm #2 tuning completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@cb5822}"

5.d) Compare Algorithms After Tuning

results <- resamples(list(RF=fit.final1, GBM=fit.final2))
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: RF, GBM 
## Number of resamples: 10 
## 
## Accuracy 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## RF  0.9977216 0.9988608 0.9992026 0.9990203 0.9993165 0.9995443    0
## GBM 0.9979494 0.9987469 0.9990886 0.9989975 0.9993165 0.9995443    0
## 
## Kappa 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## RF  0.9974937 0.9987469 0.9991228 0.9989223 0.9992481 0.9994987    0
## GBM 0.9977444 0.9986216 0.9989975 0.9988972 0.9992481 0.9994987    0
dotplot(results)

6. Finalize Model and Present Results

Once we have narrow down to a model that we believe can make accurate predictions on unseen data, we are ready to finalize it. Finalizing a model may involve sub-tasks such as:

if (notifyStatus) email_notify(paste("Model Validation and Final Model Creation has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@28d25987}"

6.a) Predictions on validation dataset

predictions <- predict(fit.final1, newdata=Xy_test)
confusionMatrix(predictions, y_test)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    2    3    4    5    6    7    8    9   10   11
##         1  1327    0    0    0    0    0    0    0    0    0    0
##         2     0 1327    0    0    0    0    0    0    4    2    0
##         3     0    0 1329    0    0    0    0    0    0    0    0
##         4     0    0    0 1329    0    0    0    0    0    0    0
##         5     0    0    0    0 1327    0    0    0    0    0    0
##         6     2    0    0    0    0 1329    0    0    0    0    0
##         7     0    0    0    0    0    0 1329    0    0    0    0
##         8     0    0    0    0    2    0    0 1329    1    0    0
##         9     0    0    0    0    0    0    0    0 1323    0    0
##         10    0    2    0    0    0    0    0    0    1 1327    0
##         11    0    0    0    0    0    0    0    0    0    0 1329
## 
## Overall Statistics
##                                           
##                Accuracy : 0.999           
##                  95% CI : (0.9984, 0.9995)
##     No Information Rate : 0.0909          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9989          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity           0.99850  0.99850  1.00000  1.00000  0.99850  1.00000
## Specificity           1.00000  0.99955  1.00000  1.00000  1.00000  0.99985
## Pos Pred Value        1.00000  0.99550  1.00000  1.00000  1.00000  0.99850
## Neg Pred Value        0.99985  0.99985  1.00000  1.00000  0.99985  1.00000
## Prevalence            0.09091  0.09091  0.09091  0.09091  0.09091  0.09091
## Detection Rate        0.09077  0.09077  0.09091  0.09091  0.09077  0.09091
## Detection Prevalence  0.09077  0.09118  0.09091  0.09091  0.09077  0.09105
## Balanced Accuracy     0.99925  0.99902  1.00000  1.00000  0.99925  0.99992
##                      Class: 7 Class: 8 Class: 9 Class: 10 Class: 11
## Sensitivity           1.00000  1.00000  0.99549   0.99850   1.00000
## Specificity           1.00000  0.99977  1.00000   0.99977   1.00000
## Pos Pred Value        1.00000  0.99775  1.00000   0.99774   1.00000
## Neg Pred Value        1.00000  1.00000  0.99955   0.99985   1.00000
## Prevalence            0.09091  0.09091  0.09091   0.09091   0.09091
## Detection Rate        0.09091  0.09091  0.09050   0.09077   0.09091
## Detection Prevalence  0.09091  0.09111  0.09050   0.09098   0.09091
## Balanced Accuracy     1.00000  0.99989  0.99774   0.99913   1.00000
predictions <- predict(fit.final2, newdata=Xy_test)
confusionMatrix(predictions, y_test)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    2    3    4    5    6    7    8    9   10   11
##         1  1326    0    0    0    0    0    0    0    1    0    0
##         2     0 1327    0    0    0    0    0    0    1    1    0
##         3     0    0 1328    5    0    0    0    0    2    0    0
##         4     0    0    0 1324    0    0    0    0    0    0    0
##         5     0    0    0    0 1328    0    0    0    0    0    0
##         6     3    0    1    0    0 1329    0    0    0    0    0
##         7     0    0    0    0    0    0 1329    0    0    0    0
##         8     0    0    0    0    1    0    0 1329    0    0    0
##         9     0    0    0    0    0    0    0    0 1325    0    0
##         10    0    2    0    0    0    0    0    0    0 1328    0
##         11    0    0    0    0    0    0    0    0    0    0 1329
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9988          
##                  95% CI : (0.9981, 0.9993)
##     No Information Rate : 0.0909          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9987          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity           0.99774  0.99850  0.99925  0.99624  0.99925  1.00000
## Specificity           0.99992  0.99985  0.99947  1.00000  1.00000  0.99970
## Pos Pred Value        0.99925  0.99850  0.99476  1.00000  1.00000  0.99700
## Neg Pred Value        0.99977  0.99985  0.99992  0.99962  0.99992  1.00000
## Prevalence            0.09091  0.09091  0.09091  0.09091  0.09091  0.09091
## Detection Rate        0.09070  0.09077  0.09084  0.09057  0.09084  0.09091
## Detection Prevalence  0.09077  0.09091  0.09132  0.09057  0.09084  0.09118
## Balanced Accuracy     0.99883  0.99917  0.99936  0.99812  0.99962  0.99985
##                      Class: 7 Class: 8 Class: 9 Class: 10 Class: 11
## Sensitivity           1.00000  1.00000  0.99699   0.99925   1.00000
## Specificity           1.00000  0.99992  1.00000   0.99985   1.00000
## Pos Pred Value        1.00000  0.99925  1.00000   0.99850   1.00000
## Neg Pred Value        1.00000  1.00000  0.99970   0.99992   1.00000
## Prevalence            0.09091  0.09091  0.09091   0.09091   0.09091
## Detection Rate        0.09091  0.09091  0.09064   0.09084   0.09091
## Detection Prevalence  0.09091  0.09098  0.09064   0.09098   0.09091
## Balanced Accuracy     1.00000  0.99996  0.99850   0.99955   1.00000

6.b) Create standalone model on entire training dataset

startTimeModule <- proc.time()
set.seed(seedNum)

# Combining datasets to form a complete dataset that will be used to train the final model
Xy_complete <- rbind(Xy_train, Xy_test)

# library(randomForest)
# finalModel <- randomForest(targetVar~., Xy_complete, mtry=3, na.action=na.omit)
# summary(finalModel)
proc.time()-startTimeModule
##    user  system elapsed 
##   0.047   0.000   0.047

6.c) Save model for later use

#saveRDS(finalModel, "./finalModel_MultiClass.rds")
if (notifyStatus) email_notify(paste("Model Validation and Final Model Creation Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@59f99ea}"
proc.time()-startTimeScript
##      user    system   elapsed 
## 61587.366   196.695 39635.617